AITopics | token id

Collaborating Authors

token id

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Parameter-Efficient Transformer Embeddings

Ndubuaku, Henry, Talhi, Mouad

arXiv.org Artificial IntelligenceMay-6-2025

Embedding layers in transformer-based NLP models typicall y account for the largest share of model parameters, scaling with vocabulary size but not yielding performance gains proportional to scale. W e propose an alte rnative approach in which token embedding vectors are first generated determini stically, directly from the token IDs using a Fourier expansion of their normalized v alues, followed by a lightweight multilayer perceptron (MLP) that captures hig her-order interactions. W e train standard transformers and our architecture on natu ral language inference tasks (SNLI and MNLI), and evaluate zero-shot performance o n sentence textual similarity (STS-B). Our results demonstrate that the propo sed method achieves competitive performance using significantly fewer paramet ers, trains faster, and operates effectively without the need for dropout. This pro of-of-concept study highlights the potential for scalable, memory-efficient la nguage models and motivates further large-scale experimentation based on our find ings. The code for reproducing and pre-trained weights are available at https://github.com/HMUNACHI/pete .

arxiv preprint arxiv, large language model, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2505.02266

Country: Europe > United Kingdom (0.14)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Perceptrons (0.55)

Add feedback

Byte BPE Tokenization as an Inverse string Homomorphism

Geng, Saibo, Gambhir, Sankalp, Wendler, Chris, West, Robert

arXiv.org Artificial IntelligenceDec-4-2024

Tokenization is an important preprocessing step in the training and inference of large language models (LLMs). While there has been extensive research on the expressive power of the neural achitectures used in LLMs, the impact of tokenization has not been well understood. In this work, we demonstrate that tokenization, irrespective of the algorithm used, acts as an inverse homomorphism between strings and tokens. This suggests that the character space of the source language and the token space of the tokenized language are homomorphic, preserving the structural properties of the source language. Additionally, we explore the concept of proper tokenization, which refers to an unambiguous tokenization returned from the tokenizer. Our analysis reveals that the expressiveness of neural architectures in recognizing context-free languages is not affected by tokenization.

context-free language, tokenization, tokenizer, (15 more...)

arXiv.org Artificial Intelligence

2412.0316

Country:

Oceania > Australia > Victoria > Melbourne (0.04)
North America > United States > Virginia (0.04)
North America > Dominican Republic (0.04)
(4 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.89)

Add feedback

On Expressive Power of Looped Transformers: Theoretical Analysis and Enhancement via Timestep Encoding

Xu, Kevin, Sato, Issei

arXiv.org Artificial IntelligenceNov-25-2024

However, their expressive power for function approximation and approximation rate remains underexplored. In this paper, we establish approximation rates of Looped Transformers by defining the concept of the modulus of continuity for sequence-to-sequence functions. This reveals a limitation specific to the looped architecture. That is, the analysis prompts us to incorporate scaling parameters for each loop, conditioned on timestep encoding. Experimental results demonstrate that increasing the number of loops enhances performance, with further gains achieved through the timestep encoding architecture.

looped transformer, token id, transformer, (13 more...)

arXiv.org Artificial Intelligence

2410.01405

Country: Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.04)

Genre: Research Report (0.84)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.67)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (0.61)

Add feedback

OpenMoE: An Early Effort on Open Mixture-of-Experts Language Models

Xue, Fuzhao, Zheng, Zian, Fu, Yao, Ni, Jinjie, Zheng, Zangwei, Zhou, Wangchunshu, You, Yang

arXiv.org Artificial IntelligenceJan-29-2024

To help the open-source community have a better understanding of Mixture-of-Experts (MoE) based large language models (LLMs), we train and release OpenMoE, a series of fully open-sourced and reproducible decoder-only MoE LLMs, ranging from 650M to 34B parameters and trained on up to over 1T tokens. Our investigation confirms that MoE-based LLMs can offer a more favorable cost-effectiveness trade-off than dense LLMs, highlighting the potential effectiveness for future LLM development. One more important contribution of this study is an in-depth analysis of the routing mechanisms within our OpenMoE models, leading to three significant findings: Context-Independent Specialization, Early Routing Learning, and Drop-towards-the-End. We discovered that routing decisions in MoE models are predominantly based on token IDs, with minimal context relevance. The token-to-expert assignments are determined early in the pre-training phase and remain largely unchanged. This imperfect routing can result in performance degradation, particularly in sequential tasks like multi-turn conversations, where tokens appearing later in a sequence are more likely to be dropped. Finally, we rethink our design based on the above-mentioned observations and analysis. To facilitate future MoE LLM development, we propose potential strategies for mitigating the issues we found and further improving off-the-shelf MoE LLM designs.

arxiv preprint arxiv, dataset, language model, (14 more...)

arXiv.org Artificial Intelligence

2402.01739

Country:

North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
Europe > Switzerland > Zürich > Zürich (0.04)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
(3 more...)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Add feedback

Memory Augmented Language Models through Mixture of Word Experts

Santos, Cicero Nogueira dos, Lee-Thorp, James, Noble, Isaac, Chang, Chung-Ching, Uthus, David

arXiv.org Artificial IntelligenceNov-15-2023

Scaling up the number of parameters of language models has proven to be an effective approach to improve performance. For dense models, increasing model size proportionally increases the model's computation footprint. In this work, we seek to aggressively decouple learning capacity and FLOPs through Mixture-of-Experts (MoE) style models with large knowledge-rich vocabulary based routing functions and experts. Our proposed approach, dubbed Mixture of Word Experts (MoWE), can be seen as a memory augmented model, where a large set of word-specific experts play the role of a sparse memory. We demonstrate that MoWE performs significantly better than the T5 family of models with similar number of FLOPs in a variety of NLP tasks. Additionally, MoWE outperforms regular MoE models on knowledge intensive tasks and has similar performance to more complex memory augmented approaches that often require to invoke custom mechanisms to search the sparse memory.

computational linguistic, mowe layer, triviaqa, (15 more...)

arXiv.org Artificial Intelligence

2311.10768

Country:

Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
South America > Brazil (0.04)
North America > United States > Washington > King County > Seattle (0.04)
(6 more...)

Genre: Research Report > New Finding (0.46)

Industry:

Leisure & Entertainment (0.46)
Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

Position Masking for Language Models

Wagner, Andy, Mitra, Tiyasa, Iyer, Mrinal, Da Costa, Godfrey, Tremblay, Marc

arXiv.org Machine LearningJun-2-2020

Masked language modeling (MLM) pre-training models such as BERT corrupt the input by replacing some tokens with [MASK] and then train a model to reconstruct the original tokens. This is an effective technique which has led to good results on all NLP benchmarks. We propose to expand upon this idea by masking the positions of some tokens along with the masked input token ids. We follow the same standard approach as BERT masking a percentage of the tokens positions and then predicting their original values using an additional fully connected classifier stage. This approach has shown good performance gains (.3\% improvement) for the SQUAD additional improvement in convergence times. For the Graphcore IPU the convergence of BERT Base with position masking requires only 50\% of the tokens from the original BERT paper.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Machine Learning

2006.05676

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > Texas > Travis County > Austin (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback